Hausdorff clustering 
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I. INTRODUCTION 

Clustering is the classification of objects into differ- 
ent groups according to their degree of similarity A 
number of criteria can be used to define this intuitive 
(and central) concept, leading in general to different par- 
titions. Due to this arbitrariness, clustering is an in- 
herently ill-posed problem, as a given data set can be 
partitioned in many different ways without any particu- 
lar reason to prefer one solution to another. It is clear 
that a clustering technique can be profoundly influenced 
by the strategy adopted by the observer and his /her own 
ideas and preconceptions on the problem. 

Clustering algorithms can be classified in different ways 
according to the criteria used to implement them Q: 

(A) If, for example, one focuses on the solution, a 
fundamental distinction can be drawn between hierar- 
chical and partitive techniques. Hierarchical methods 
yield nested partitions, in which any cluster can be fur- 
ther divided in order to observe its underlying structure. 
Typical examples are the agglomerative and divisive al- 
gorithms that produce dendrograms 0]. On the other 
hand, partitional methods provide only one definite par- 
tition which cannot be analyzed in further details. 

(B) By contrast, if one focuses on data representation, 
two schemes are possible: central [H and pairwise 0| 
clustering. In central clustering, the data are described 
by their explicit coordinates in the feature space and each 
cluster is represented by a prototype (for instance, the 
mean vector and the corresponding spread). In pairwise 
clustering, the data are indirectly represented by a dis- 
similarity matrix, which provides the pairwise compar- 
ison between different elements. Clearly, the choice of 
the measure of dissimilarity is not unique and the per- 
formance of any pairwise method strongly depends on 
it. 



'Electronic address: ester. pantalco@b a. infn.it] 



(C) Finally, if one focuses on the strategy of the algo- 
rithm, two approaches can be adopted: parametric and 
non-parametric clustering. Parametric algorithms are 
adopted when some a priori knowledge about the clusters 
is available and this information is used to make some as- 
sumptions on the underlying structure of the data. Vice 
versa, the non-parametric approach to clustering may 
represent the optimal strategy when there is no prior 
knowledge about the data. In general, these methods 
follow some local criterion for the construction of the 
clusters, such as, for instance, the identification of high 
density regions in the data space 

From the mathematical point of view, given a set of 
objects S = {s}, an allocation function m : s G <S — > 
{1, 2, . . . , k}, must be defined so that m(s) is the class la- 
bel and k the total number of clusters (which we assume 
to be finite for simplicity); k may be chosen a priori or 
computed within the algorithm. The aim of a clustering 
procedure is to select, among all possible allocation func- 
tions, the one performing the best partition of the set S 
into subsets Q a = {s G <S : m(s) = a} {a = 1, . . . , k), 
relying on some measure of similarity. The space of any 
clustering solution is the set A4 of all possible allocation 
functions. 

In this article we will focus on a class of clustering tech- 
niques called linkage algorithms. Linkage algorithms are 
hierarchical, agglomerative and non-parametric methods 
that merge, at each step, the two clusters with the small- 
est dissimilarity, starting from clusters made of a sin- 
gle element, ending up in one cluster collecting all data. 
We will analyze the so-called single and complete link- 
age methods and will introduce a linkage method based 
on Hausdorff's distance. We will use as a mathematical 
definition of dissimilarity a suitable metric in the space 
of the partitions of the given data set 0- Notice that 
in general a similarity measure need not be a distance in 
the mathematical sense; on the other hand, if one aims 
at clustering in a parameter space, a distance could be 
the best choice because it does not introduce any degree 
of arbitrariness. It is worth stressing that alternative 
philosophies arc also possible, in which the clustering al- 
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gorithm is governed by purely topological notions and 
unveils efficient collective dynamics in animal behavior 
Q. A comparison among these methods belongs to the 
realm of statistical mechanics and is beyond the scope of 
this article. See Q for an excellent discussion. 

We will focus on finite sets and clusters, although we 
will keep our analysis on the metric features of the rele- 
vant spaces as general as possible. We will start in Sec. 
Irfl by reviewing and clarifying some mathematical con- 
cepts concerning distance and linkage methods, focusing 
on the single and complete linkage algorithms in Sec. lIIII 
The Hausdorff distance and the related clustering proce- 
dure will be introduced in Sec. IIV1 Section [V] is devoted 
to the comparison of the different methods on some data 
sets, including both a toy problem and a case study on 
financial time series. Some conclusions are drawn in Sec. 

eh 

II. PRELIMINARES 
A. Distances and pseudodistances 

We start by recalling the mathematical definition of 
distance. Given a set <S, a distance (or a metric) S is a 
non-negative application 

5:SxS — >R+ (1) 

on K + = [0,oo), endowed with the following properties, 
valid Vx, y € S: 

S(x,y)=0 ^ x = y, (2) 
S(x,y) = 5(y,x), (3) 
S(x,y)<6(x,z)+5(y,z), Vz G S (4) 

Incidentally, notice that symmetry ([3]), as well as non- 
negativity, arc not independent assumptions, but easily 
follow from ([2]) and the triangular inequality ([4]). If the 
triangular inequality is written as 

S(x,y) < 6(x,z) + 6{z,y), Vz G S, (5) 

as is often the case, symmetry ([3]) must be independently 
postulated. We will henceforth denote a metric space by 

An application (fl]) is a pscudomctric [101 ] if property 
d2J) is weakened: 

x = y =>■ S(x,y) = . (6) 

In such a case, distinct elements of the set S can be at 
a null distance. A set endowed with a (pseudo)metric is 
called a (pseudo) metric space. 

B. Linkage algorithms 

Linkage algorithms are hierarchical methods, yielding 
a clustering structure that is usually displayed in the form 



of a tree or dendrogram [3J. We will adopt an agglom- 
erative algorithm, where the clusters are linked through 
an iterative process, whose successive steps are the fol- 
lowing. Given a data set 5, made up of n elements, at 
the first level (leaves of the dendrogram) the number of 
classes is equal to the number of elements. We assume 
(without loss of generality) that S is a metric space [24j . 
At the first iteration the two closest elements are clus- 
tered together, reducing the number of classes to n — 1 
(if more than two elements are at the closest distance, 
we pick a random couple among them). At the second 
iteration one has to tackle the subtler problem of defining 
a distance between the remaining elements of S and the 
first cluster formed. When this is done, the distances are 
recomputed and the two closest objects are joined. At the 
following iterations one has to tackle the much more sub- 
tle problem of defining a distance among classes. Clearly, 
this can be done in a variety of different ways and entails 
further elements of arbitrariness. Assume that this pro- 
cedure can be carried out consistently. After n steps, 
all the points are grouped together in one cluster, cor- 
responding to the whole data set. The agglomerative 
procedure is reversed in a straightforward way in the so- 
called divisive approach: starting from one single cluster, 
this is iteratively divided into smaller and smaller ones, 
until single elements are obtained. 

The most commonly used algorithms of this type are 
the "complete" and the "single" linkage, that differ in the 
definition of "distance" between subsets of points. In the 
next section we will briefly review these two algorithms. 




FIG. 1: For a set A containing more than one element, 
d c (A, A) ^ and neither @ nor © are valid. 



III. COMPLETE AND SINGLE LINKAGE 

A. "Distances" 

Linkage algorithms differ from each other for the dif- 
ferent similarity criteria used to build the clusters. An 
optimal criterium would rely on a metric d defined on the 
subsets of the parent space iS: 

d : K{S) x JC{S) — >«.+ , (7) 

where K,(S) is the collection of all the nonempty compact 
subsets of S. (We restrict the metric to the above class 



of subsets in order to avoid some patologies, see later.) 
Such a metric can be defined in a natural way by using 
the original metric 5 defined on S. If A and B are two 
non empty compact subsets of S, the complete and single 
linkage ansatzs make use of the following "distances" 



d c (A,B) 



d s {A,B) 



sup S(a,b), 

a£A,b£B 



inf 5(a, b), 

aeA,b£B 



(8) 



(9) 



respectively. However, it is easy to check that neither 
one of the above functions is a bona fide distance in the 
mathematical sense. The function (|8|) is obviously non- 
negative and symmetric, so ([3]) is valid. Moreover, the 
triangular inequality ([!]) is satisfied: 

d c (A, B) = sup 5(a, b) 

a£A,b£B 

< sup (S(a,c) + S(b,c)) 

a£A,b£B,c£C 



< 



sup (5(a, c) + sup S(b, c) 

aeA,b£B,c£C aeA.beB.ceC 

sup <5(a, c) + sup S(b, c) 

a£A,c£C beB,c£C 



d c (A,C)+d c (B,C) 



(10) 



Yet, property © is not valid in general, as for a set A 
made up of more than one element, the distance of A from 
itself equals the distance between its farthest objects: 



d c {A,A)^Q 



(11) 



This is graphically displayed in Fig. Q] and shows that ([5]) 
is not even a pscudodistancc 
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Intuitively, this is not an important issue for "small" 
sets, but it becomes an increasingly serious problem for 
"larger" sets. Clearly, the notions of "small" and "large" 
must be properly defined: for a compact metric space 
of size R, we may say that a subset of size r is small if 
r <C R (say by at least one order of magnitude) 26] . This 
situation will directly concern us in the next sections. 

Consider now the second function ©, which is non 
negative and symmetric, so is valid. Notice that the 
pseudometric property is satisfied 



A = B 



d s (A,B)=0, 



(12) 



although the converse is not true [so that property ([2]) is 
not valid]: consider for instance two sets A and B such 
that Aflfi / 0: in this case d s (A, B) = 0, as this is, 
by definition, the distance S of a common element from 
itself. Finally, the triangular inequality ([4]) is not verified, 
as can be easily inferred by looking at the counterexample 
in Fig. [21 for which 



d s (A,B) >d s (A,C)+d a (B,C). 



(13) 



The function d s is therefore neither a metric nor a pseu- 
dometric. As we shall see in Sec. lIIICl this problem gives 
rise to the chaining effect. 



4(A,B) 



4(A.C) 



C 




4(B.C) 



FIG. 2: Three sets A, B, C, containing each two ele- 
ments, for which d 3 does not satisfy the triangular inequality: 
ds(A,B)>d s (A,C) + d s (B,C). 



B. Finite sets 

We will look explicitly at the practical case in which 
© and © are evaluated on finite sets. It is therefore 
convenient to specialize the formulas of the preceding 



section to such a situation. Let A = 
B = {bj}j=i,...,j be two finite sets and 

5ij = 5(ai,bj) 



and 



(14) 



the distance between any two elements of A and B. The 
Sij 's can be arranged in a / x J "distance" matrix. Equa- 
tions |(5J) and ([9]) read then 



d s {A, B) — min min 6, 



ieA j£B 



d c (A, B) = max max 5. 



ieA jeB 



(15) 



(16) 



for the single and complete linkage algorithms, respec- 
tively. In practice, this amounts to determine the smaller 
and the larger value among the rows and the columns of 
the distance matrix, respectively a task that can be per- 
formed in a polynomial time. These formulas will be 
applied in the following examples. 



C. Comments 

It is worth commenting on the features of the two clus- 
tering ansatzs introduced, emphasizing their limits and 
positive aspects. The single linkage algorithm tends to 
yield elongated clusters, which are sometimes difficult to 
understand and poorly significant [|[: this is known as 
chaining effect. On the contrary, the complete linkage 
has the advantage of clustering "compact" groups and 
produces well localized classes. In general, the partitions 
obtained using it are more significant. Its major disad- 
vantage is that it does not set equal to zero the distance of 
a "compact" set from itself [see Eq. (fTTj) and Fig.Q], per- 
forming de facto a coarse graining. In few words, d c looks 
at the data points with a "minimal resolution" (that is 
also, unfortunately, cluster dependent) and is unable to 
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recognize the complexity of a finely structured cluster 
and to extract "nested" clusters, such as those displayed 
in Fig. [3] [ll|. Notice that, by contrast, such "nested" 
clusters are very efficiently detected by the single linkage 
algorithm, as shown in Fig. [3] 



t ' 1 "i r 
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FIG. 3: Concentric clusters analyzed in terms of the single 
linkage algorithm. This procedure is very efficient in discrim- 
inating nested structures of this kind. For the same reason, 
it suffers from the so-called "chaining effect." 

In the next section we shall introduce a procedure that 
is somewhat "in between" single and complete linkage 
and makes use of an underlying bona fide distance. This 
will have some advantages, also from a conceptual view- 
point, as it enables one to rest on firm mathematical 
background. 



IV. HAUSDORFF DISTANCE AND 
HAUSDORFF LINKAGE 

In the light of the discussion of the preceding section, 
it appears convenient to approach the clustering problem 
from a "neutral" perspective, by looking for a linkage al- 
gorithm based on a well-defined mathematical similarity 
criterium. In order to do this, we will use a distance 
function introduced by Hausdorff 



A. Hausdorff distance 

Given a metric space (S,d), the distance between a 
point a G S and a (nonempty and compact) subset B G 
IC(S) is naturally given by 

d(a; B) = inf 8(a, b) (17) 




FIG. 4: Hausdorff distance between two sets A (a square) 
and B (a rectangle). The open neighborhoods N ri (A) and 
N T2 (B) are shaded, n = d(B; A), r 2 = d(A;B). The Haus- 
dorff distance is r 2 . 

Given a subset A G IC(S), consider the function 

d(A; B) = sup d(a; B) = sup inf 6(a, b), (18) 

aEA a£A b £B 

that measures the largest distance d(a;B), with a E A. 
Note that here the strategy is opposite to that used with 
the single linkage "distance" ©, where one considers in- 
stead the smallest distance d(a;B), with a G A. The 
function (fT5|) is not symmetric, d(A; B) ^ d(B;A), and 
therefore is not a bona fide distance, as it does not sat- 
isfy ([3]). The Hausdorff distance [12| between two sets 
A,B E IC(S) is defined as the largest between the two 
numbers: 

dn(A,B) = max{J(A;B), d(B;A)}, (19) 

namely, 

dn(A,B) = max{sup inf S(a, 6), sup inf S(a, b)}, (20) 

a( z A beB beB a£A 

that is clearly symmetric and satisfies all axioms ©-(HI). 

It is worth discussing a bit more the mathematical fea- 
tures of djj ■ This will help us grasp its interesting prop- 
erties, towards physical applications. 

Given a set A G IC(S) and a positive real number r > 0, 
define the open r-neighborhood of A as: 

N r (A) = {y: d(y;A)<r} . (21) 

The Hausdorff distance between two sets A,BG IC(S) 
can be reexpressed as 

da(A, B) = mi{r : A C N r {B) and B C N r (A)} . (22) 
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Indeed 

d B (A,B) 

= inf{r : A C N r (B), B C A^ r (A)} 
= inf({r : A C AT r (B)} n {r : B C AT r (A)}) 
= max{inf{r : A C N r (B)}, inf{r : B C iV r (A)}} 

(23) 

and since 

inf{r : A C iV r (S)} = sup inf{r : .t G N r (B)} 

= sup inf <5(x, y) , (24) 

and analogously for inf{r : B C iV r (^4)}, one gets again 
([2D)) . Stated differently, the HausdorfT distance can also 
be defined as the smallest radius r such that N r (A) con- 
tains B and at the same time N r (B) contains A. 

In words, the HausdorfT distance between A and B is 
the smallest positive number r, such that every point of 
A is within distance r of some point of B, and every 
point of B is within distance r of some point of A. The 
geometrical meaning of the Hausdorff distance is best 
understood by looking at an example, such as that in 
Fig. [U We emphasize that the Hausdorff metric on the 
subsets of S is defined in terms of the metric 8 on the 
points of iS. 

The Hausdorff distance enjoys a number of interesting 
features, that are worth discussing. We have defined dn 
only on nonempty compact sets for the following reasons. 
Consider for example the real line. Then, by adopting the 
convention inf{0} = oo [27| . one gets Vx, g?h(0, x) = oo, 
which is not allowed by any definition of metric. This sug- 
gests that we should restrict our attention to nonempty 
sets. Moreover, dn({0}, [0, oo)) = oo, which is again not 
allowed. We then restrict the use of da only to bounded 
sets. Finally, the Hausdorff distance between two not 
equal sets could vanish [which would make g?h a pseudo- 
metric, see (JBJ)]: for instance du((0, 1), [0, 1]) = 0. There- 
fore we will restrict the application of dn only to closed 
sets. 

More generally, it is easy to prove the following 
Theorem: The Hausdorff function <in is a metric on the 
set K,(S). Moreover, if (S, 8) is a complete metric space, 
then the space (K.(S),dn) is also complete. 

Although of an abstract nature, this is of physical sig- 
nificance, as it enables one to be confident about the 
metric properties of IC(S) even for fine-structured clus- 
ters. Notice that the property of completeness could not 
even be conceived for the "distance" d c used for the com- 
plete linkage in the last section. In conclusion, 

d H : JC(S) x JC(S) — ► M.+ (25) 

is a complete metric. In the cases of interest, S will be a 
complete metric space, e.g., an Euclidean space. 

We close this section with two remarks. First, if the 
data set is finite and consists of N elements, all distances 



can be arranged in a N x N matrix Sij and Eq. (f20|) reads 
dvi(A, B) — maxlmaxminj,, , max min 8, j , (26) 

V 1 1 i£A j€B 3 ' j£B i£A JJ ' v ' 

which is a very handy expression, as it amounts to find- 
ing the minimum distance in each row (column) of the 
distance matrix, then the maximum among the minima. 
The two numbers arc finally compared and the largest 
one is the Hausdorff distance. This sorting algorithm is 
efficient and can be easily implemented. 
Second, VA, B G JC{S) 

d s (A, B) < d K (A, B) < d c (A, B). (27) 

This is a simple consequence of (f2T)]) and the definitions 
© and © [or flTBJ| and flTSJl in the discrete case]. 

In some sense, d c overestimates the distance between two 
given sets, essentially because it includes in such a dis- 
tance the very "size" (fTTj) of the set (see Fig. [1} . On the 
other hand, d s underestimates it. As we shall see, this 
has important consequences when one clusters complex 
and/or large sets. 



B. Hausdorff linkage 

We shall take the Hausdorff distance as our dissimilar- 
ity measure. This distance naturally translates in a link- 
age algorithm: at the first level each element is a cluster, 
the Hausdorff distance between any pair of points reads 

dn({i},{j}) = Sij (28) 

and coincides with the underlying metric. The two ele- 
ments of S at the shortest distance are then joined to- 
gether in a single cluster. The Hausdorff distance matrix 
is recomputed, considering the two joined elements as a 
single set. This iterative process goes on until all points 
belong to a single final cluster. 

Clearly, when evaluating distances among single ele- 
ments (points), the three procedures da, d s , d c yield the 
same result. The output of the single linkage algorithm 
will clearly differ very quickly from the other two, due to 
the drawbacks of the chaining effect. On the other hand, 
the differences between Hausdorff and complete linkage 
will become apparent only later in the clustering process. 
This is a consequence of the fact that the functions dn 
and d c yield the same value when evaluated on a single 
element {a} and a composite set B. Indeed, from (|2H|) : 

d H ({a},B) 

= max{ sup inf 8(x,y), sup inf 8(x,y)} 

iE{a} yeB y£B x£{a} 

= max{ inf 8(a,y), sup 8(a,y)} 

y£B y<EB 

= sup 8(a,y) 
y eB 

= d c {{a},B). (29) 
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As a consequence of this property, at the lowest levels the 
Hausdorff linkage will yield a partition that is very sim- 
ilar to that obtained by the complete linkage algorithm. 
As the clustering procedure goes on, the two methods will 
differ from each other, because of their different criteria in 
evaluating distances, leading to different aggregations of 
more complex classes. It is at this point that the output 
of the complete linkage becomes less reliable, as a conse- 
quence of (fTTj) and (j27|) . As discussed after Eq. (fTTj) , we 
expect this problem to become serious for "large" sets, 
of size comparable to that of the parent space. 

The partitions obtained by the Haudorff linkage algo- 
rithm will be intermediate between those obtained by the 
other two procedures. We shall now compare the three 
clustering methods, first on an artificial set of points in a 
two dimensional Euclidean space, then on financial time 
series. 

A final comment is in order. Given a distance matrix, 
any clustering procedure will yield a tree and an ultra- 
metric, entailing a loss of information on the data set. 
However, this appears necessary and is inherent in any 
clustering procedure. 



V. APPLICATIONS 

A. Two-dimensional data set 

Let us analyze the effect of the single, complete and 
Hausdorff linkage algorithms on the data set shown in 
Fig. [5l This is a discrete set of points in the plane, re- 
sembling a pair of "glasses" (each one made up of 31 
points) connected by a short horizontal "bar" (5 points) 
and two "pupils" (each one made up of 2 points), for a 
total of n = 71 points. 
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The dendrograms generated by the three algorithms 
are shown in Fig. [51 The chaining effect of the single link- 
age is apparent. This can be an advantage if one wants 
to bring to light the presence of a "continuous" line of 
points; it is a drawback in a parameter space because 
data characterized by opposite values of the parameter 
on the abscissa in Fig. [5] are clustered together. As antic- 
ipated, a discrimination between the two other methods 
is more difficult. However, as discussed after Eq. (TTTj) . 
the differences should become apparent for "large" sets, 
of size comparable to that of the parent space: for a par- 
ent space made up of n = 71 (approximately linearly 
distributed) points, we expect this effect to show up for 
sets made up of more than 7 points, as one can see in 
Fig. 1 

A proper way to cut the dendrograms could be to 
search for a stable partition among the whole hierarchy 
yielded by the algorithms, in correspondence to an ap- 
proximately constant value of the cluster entropy in a 
certain range of the dissimilarity measure d [l3T ] 



S(d) =- ^ P d (k) In P d (k) 



(30) 



k=l 



where Pd(k) is the fraction of elements belonging to clus- 
ter k, and N d the number of clusters at level d in the 
dendrogram. The complete and Hausdorff entropies cor- 
responding to the dendrograms in Fig.[5Jare shown in Fig. 
[7J We emphasize that, for the case at hand, the data set 
was intentionally chosen so that one cannot expect an 
obvious partition into "sensible" clusters. For this very 
reason, the entropies in Fig. [7] display no "plateau." The 
optimal cut is then chosen according to a visual opti- 
mization of the clustering solution. Figure [8] shows the 
selected partitions: while the single linkage yields a clear 
chaining effect, both complete and Hausdorff methods 
share the positive aspect of clustering rather "compact" 
sets. Moreover, all other clusters being roughly similar, 
the Hausdorff procedure is also able to discriminate the 
two-points "pupils" in Fig.O in this respect it enjoys the 
positive spin-offs of the single linkage algorithm. On the 
other hand, the complete linkage algorithm clusters each 
"pupil" together with a part of its nearest "glass." 



FIG. 5: A two-dimensional toy sample: a pair of "glasses" 
(each one made up of 31 points) connected by a short hori- 
zontal "bar" (5 points) and two "pupils" (each one made up 
of 2 points), for a total of n = 71 points. 

This example aims at showing how difficult it can be 
to discriminate between complete and Hausdorff linkage: 
while the single linkage will obviously suffer from the 
chaining effect (and will cluster points at the opposite 
sides of the figure), the other two procedures will perform 
in a similar fashion at the beginning, yielding different 
clusters only when the classes become more complex. 



B. Financial Data 

The use of clustering algorithms can improve the re- 
liability of a financial portfolio [l4|. Here we apply the 
Hausdorff algorithm to the analysis of financial time se- 
ries Q. In particular, we focus on the N = 30 shares 
composing the DJIA index, collecting the daily closure 
prices of its stocks for a period of 5 years (1998-2002). 
The companies of the DJIA stock market are reported in 
Appendix [Al together with the corresponding industrial 
areas. 

We consider the temporal series of the logarithm of the 




7 

a 

6 
S 
4 
3 



complete 




J JO 20 30 40 SO 60 H I JO 20 30 40 SO SO 71 1 10 20 30 40 50 60 71 

FIG. 6: Dendrograms generated by the single, complete and Hausdorff linkage, for the data set of Fig. [5] 




FIG. 7: Cluster entropies of the dendrograms of Fig. [6] 
Dashed black line: single; continuous red line: Hausdorff; 
blue dot-dashed line: complete. 



ratio of two consecutive closure prices 



X(t) = In 



p(t-iy 



(31) 



where P(t) is the closure price of a stock at day t. Both 
P and X are very irregular functions of time, as one 
can see in Fig. [5J that displays the typical behavior of 
a stock value (MSFT) for the investigated time period. 
In order to use the linkage algorithm, we quantify the 
degree of similarity between two time series X and Y by 
means of the correlation coefficients computed over the 
investigated time period: 



P(X,Y) 



cov(X, Y) _ E[(X - n x )(Y - fx Y )} 



a x Oy 



(32) 



where E is the expectation value over the time interval 
of interest (one year in our case), fix = E[X] and ax = 
^yE[X^]^~i^. Figure [10] shows the correlation matrix 
p(X, Y) computed for the year 1998: each element is 
displayed in a color scale ranging from blu (minimum 
value) to red (maximum value). It is worth stressing 




-1 



-2 

-2 



FIG. 8: Clustering results for single (up), complete (middle), 
and Hausdorff (bottom) linkage. Objects belonging to the 
same cluster share the same symbol: for example the complete 
algorithm groups the "pupil" on the left (red full circles) with 
14 point belonging to its nearest "glass" (red full circles). 



that almost all correlation coefficients are positive, with 
values not too close to 1, thus confirming that, in many 
cases, stocks belonging to the same market do not move 
independently from each other, but rather share a similar 
temporal behavior. 

The metric function we adopted to quantify the time 
synchronicity between two stocks is the following [l5|, |T^, 



8 




MSFT Microsoft (1398-HXJ?! 
flDi 1 1 r— 




FIG. 9: Time evolution of the closure price P(f) and the 
logarithm of the ratio of consecutive closure prices X(t) [see 
Eq. fl3T)] of a stock value (MSFT), for the period 1998-2002. 




FIG. 10: Correlation matrix p(X, Y) computed for the year 
1998: each element is displayed in a color scale ranging from 
blu (minimum value) to red (maximum value) 




d(X, Y) = s/2{l- p{X,Y)) . (33) 

The distance ([33)) is a proper metric in the parent space, 
ranging from for perfectly correlated series [p(X, Y) = 
+1] to 2 for anticorrelated stocks [p(X,Y) = — 1], The 
representative points lie on a hypersphere and d(X, Y) 
measures the Euclidean (and not the geodesic) distance 
between X and Y. Figure [IT] shows the distance matrix 
d(X,Y) computed for the year 1998: each element is 
displayed in a color scale ranging from blu (d = 0) to red 
(d = \/2). The tree structure obtained for this set was 
already scrutinized and discussed in Ref. Q. We shall 
focus here on the features of the dendrograms. 

Figure [TSl shows the dendrograms obtained by cluster- 
ing the stocks yearly from 1998 to 2002, with the single, 
complete and Hausdorff linkage. Some considerations are 
in order. As expected, the single linkage algorithm suf- 



FIG. 11: Distance matrix d{X, Y) computed for the year 1998: 
each element is displayed in a color scale ranging from blu 
(d = 0) to red (d = y/2) 



fers from the chaining effect [3j, which yields elongated 
clusters: different points merge into a large cluster al- 
most one at time during the iterative procedure, with 
the result of obtaining a poorly defined tree structure, 
as it can be clearly observed in Fig. [12] (from sgg to 
•802)- Wherever one would choose to cut the dendro- 
gram, no meaningful partition would emerge out of the 
hierarchical tree. On the other hand, the dendrograms 
obtained by means of both the complete and Hausdorff 
algorithms show clear inner structures, corresponding to 
the branches of the hierarchical tree. One recognizes the 
clusters corresponding to homogeneous (from the indus- 
trial viewpoint) groups of companies, belonging to the 
same industrial area: this is the case of the money cen- 
ter banks {C, JPM AXP}, retail companies {HD, WMT}, 
companies dealing with basic materials {AA, IP, DD}, 
and the technological core {IBM, INTC, MSFT}. 

The classification of stocks in terms of their economic 
homogeneity as well as the presence of superclusters and 
homogeneous subgroups was already discussed in [3] and 
will not be analyzed here. However, there are charac- 
teristic features of the dendrograms that deserve addi- 
tional attention. An interesting phenomenon, consisting 
in "backsteps" in the dendrograms, sometimes appears 
in the Hausdorff clustering, as shown in ho2 of Fig. IT2| 
the dendrogram obtained by clustering the financial time 
series in 2002. This pattern is mathematically spelled 
out in Appendix [B] where its significance is elucidated in 
terms of an elementary example (see Fig. [TB")) . We take 
this phenomenon as an indicator of the potentialities of 
a clustering algorithm based on the Hausdorff distance, 
that could be exploited in a non-hierarchical algorithm, 
allowing backsteps and hierarchy breaking. 



VI. SUMMARY 

Clustering is a common practice in the analysis of com- 
plex data and reflects a human compulsion towards clas- 
sifying objects or physical phenomena. This can be a 
difficult task when the phenomena are complicated and 
the underlying correlations difficult to bring to light. 
We have introduced and analyzed a clustering procc- 




FIG. 12: Dendrograms obtained by clustering the stocks from 1998 to 2002 for: single linkage (from S98 to S02), complete linkage 
(from C98 to C02) and Hausdorff linkage (from hgg to ho2)- The acronyms are explained in Appendix [A] Some "backsteps" can 
be clearly observed in ho2- A mathematical explanation of this phenomenon is given in Appendix [Bl 
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dure based on a bona fide distance introduced by Haus- 
dorff. The method, that relies on an underlying distance 
among the elements that make up the "parent" set, has 
been compared with both the single and complete link- 
age procedures, which only rely on an underlying dis- 
similarity measure (not a distance). We first looked at 
a toy problem, in which the Hausdorff method has evi- 
dent advantages in comparison with the other ones. We 
then clustered the financial time series of the DJIA stock 
market, observing the formation of clusters of "homo- 
geneous" companies: the results obtained are significant 
from an economical point of view. 

An important application of the method introduced 
here is certainly in portfolio optimization [3, [TH, [HI, 
[20l . [2l| . where the key issue is to select one (or a few) 
stocks that are representative of a given cluster, charac- 
terized by economic homogeneity, reducing maintenance 
costs and optimizing risk. Among the possible future de- 
velopments, one should test the stability of the method 
against noise effects [22|, [23j] and endeavor to understand 
the practical consequences of hierarchy breaking due to 
the backsteps discussed in the previous section. 



MCD: McDonalds Corp. - Services 

MMM: Minnesota Mining - Conglomerates 

MO: Philip Morris - Consumer Non-Cyclical 

MRK: Merck & Co. - Healthcare 

MSFT: Microsoft - Technology 

PG: Procter & Gamble - Consumer Non-Cyclical 

SBC: SBC Communications - Services 

T: AT&T Gamble - Services 

UTX: United Technology - Conglomerates 

WMT: Wal-Mart Stores - Services 

XOM: Exxon Mobil - Energy 

APPENDIX B 



APPENDIX A: DOW JONES STOCK MARKET 
COMPANIES 

AA: Alcoa Inc. - Basic Materials 

AXP: American Express Co. - Financial 

BA: Boeing - Capital Goods 

C: Citigroup - Financial 

CAT: Caterpillar - Capital Goods 

DD: DuPont - Basic Materials 

DIS: Walt Disney - Services 

EK: Eastman Kodak - Consumer Cyclical 

GE: General Electrics - Conglomerates 

GM: General Motors - Consumer Cyclical 

HD: Home Depot - Services 

HON: Honeywell International - Capital Goods 

HPQ: Hewlett-Packard - Technology 

IBM: International Business Machine - Technology 

INTC: Intel Corporation - Technology 

IP: International Paper - Basic Materials 

JNJ: Johnson & Johnson - Healthcare 

JPM: JP Morgan Chase - Financial 

KO: Coca Cola Inc. - Consumer Non-Cyclical 



ri,,(AUB,C) 



rf//(A.B) 



rf w (A.B)>rf/ f (AUB,C) 



J 1_ 



d;,(A,B) d„(AuB,C) 

FIG. 13: Example of a backstep in the Hausdorff linkage. 
Given three sets A (a segment), B (another segment) and C 
(a "U") , the Hausdorff linkage algorithm links A and B at 
a distance dn(A,B), then links Au B and C at a distance 
d H (A UB,C) < d H (A, B). The set C is nearer toiUB than 
it is to A and B separately. The corresponding dendrogram 
is drawn below. 
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We explain here the phenomenon of the backsteps ob- 
served in the Hausdorff dendrogram of Fig. [TJ] (see panel 
/102) and argue that the Hausdorff hicrachical clustering 
does not exploit all the potentialities of the Hausdorff 
distance. 

Let us consider the three compact sets of the Euclidean 
plane shown in Fig.[l3j Set A is a segment, B is another 
segment and C is a polygonal "U" . They are arranged 
in such a way that 

d u (A, B) < d H (A, C), and d H (A B) < d n (B, C). 

(Bl) 

Therefore, the Hausdorff linkage algorithm starts off by 
linking A and B at a distance dn(A, B) into a cluster 
D = A U B. But now it happens that the Hausdorff 
distance between C and cluster D is smaller than the 
Hausdorff distance between A and B, namely 

d H (D,C) = d H (AU B,C) < d K (A,B). (B2) 

Therefore, the set C is nearer to D = A U B than it is to 
A and B separately, 

dn(A UB,C)< d H (A, C), d n (B, C), (B3) 

and the corresponding dendrogram exhibits a backstep. 



It can therefore happen that two sets, after their ag- 
gregation, become Hausdorff-closcr to a third set than 
they were separately. This explains (from a mathemati- 
cal viewpoint) the phenomenon of the backsteps observed 
in Fig. [T2] (see panel h 2). 

Therefore, backsteps are a direct consequence of the 
very definition of the Hausdorff distance. The existence 
of backsteps implies that c?h cannot be used as the Haus- 
dorff hierarchy's aggregation index. Indeed, an aggrega- 
tion index is a positive function / defined on the hierar- 
chy Y satisfying (i) f(y) = if and only if y is reduced 
to a single element of S and (ii) f(y) < f(y') if y G y' ■ 
Equation (|B3jl is at variance with condition (ii). On the 
other hand, the complete and single hierarchical algo- 
rithm generate a hierarchy indexed through d c and d s 
respectively. Nonetheless, the Hausdorff hierarchy can be 
indexed through a proper choice of the aggregation index 
/. This will be clarified in a forthcoming article. From 
a more intuitive (physical) perspective, condition (|B3|) 
can become valid when the sets are rather intertwined, 
and can be taken as an indication that, although always 
mathematically consistent, the clustering procedure it- 
self at this level of the dendrogram becomes doubtful, in 
particular for inherently complex problems, such as that 
of clustering stock market companies. 
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nontrivial cases, some clusters must eventually acquire a 
nonvanishing distance from themselves at some iteration. 
For non-compact sets whose subsets are uniformly dis- 
tributed with linear density a, a subset is "small" if its 
size r <C a -1 . 

This is motivated by thinking of unbounded sets: the 



Hausdorff distance between a point A — {a} and a 
set B having an accumulation point at infinity (such 
as a straight line) is indefinitely large, for no open r- 
neighborhood N r (A) will ever contain B, no matter how 
large r. 



